Home > Computers & Technology > Computer Science > AI & Machine Learning > Natural Language Processing

Data-Intensive Text Processing with MapReduce by Jimmy Lin & Chris Dyer

Author:Jimmy Lin & Chris Dyer , Date: July 4, 2020 ,Views: 229

Data-Intensive Text Processing with MapReduce by Jimmy Lin & Chris Dyer

Author:Jimmy Lin & Chris Dyer
Language: eng
Format: epub
Publisher: Morgan & Claypool Publishers
Published: 2012-07-14T16:00:00+00:00

4.5.1 BYTE-ALIGNED AND WORD-ALIGNED CODES

In most programming languages, an integer is encoded in four bytes and holds a value between 0 and 232 − 1, inclusive. We limit our discussion to unsigned integers, since d-gaps are always positive (and greater than zero). This means that 1 and 4,294,967,295 both occupy four bytes. Obviously, encoding d-gaps this way doesn’t yield any reductions in size.

A simple approach to compression is to only use as many bytes as is necessary to represent the integer. This is known as variable-length integer coding (varInt for short) and accomplished by using the high order bit of every byte as the continuation bit, which is set to one in the last byte and zero elsewhere. As a result, we have 7 bits per byte for coding the value, which means that 0 ≤ n < 27 can be expressed with 1 byte, 27 ≤ n < 214 with 2 bytes, 214 ≤ n < 221 with 3, and 221 ≤ n < 228 with 4 bytes. This scheme can be extended to code arbitrarily large integers (i.e., beyond 4 bytes). As a concrete example, the two numbers:

127, 128

would be coded as such:

1 1111111, 0 0000001 1 0000000

The above code contains two code words, the first consisting of 1 byte, and the second consisting of 2 bytes. Of course, the comma and the spaces are there only for readability. Variable-length integers are byte-aligned because the code words always fall along byte boundaries. As a result, there is never any ambiguity about where one code word ends and the next begins. However, the downside of varInt coding is that decoding involves lots of bit operations (masks, shifts). Furthermore, the continuation bit sometimes results in frequent branch mispredicts (depending on the actual distribution of d-gaps), which slows down processing.

A variant of the varInt scheme was described by Jeff Dean in a keynote talk at the WSDM 2009 conference.10 The insight is to code groups of four integers at a time. Each group begins with a prefix byte, divided into four 2-bit values that specify the byte length of each of the following integers. For example, the following prefix byte:

00,00,01,10

indicates that the following four integers are one byte, one byte, two bytes, and three bytes, respectively. Therefore, each group of four integers would consume anywhere between 5 and 17 bytes. A simple lookup table based on the prefix byte directs the decoder on how to process subsequent bytes to recover the coded integers. The advantage of this group varInt coding scheme is that values can be decoded with fewer branch mispredicts and bitwise operations. Experiments reported by Dean suggest that decoding integers with this scheme is more than twice as fast as the basic varInt scheme.

In most architectures, accessing entire machine words is more efficient than fetching all its bytes separately. Therefore, it makes sense to store postings in increments of 16-bit, 32-bit, or 64-bit machine words. Anh and Moffat [8] presented several word-aligned coding methods, one of which is called Simple-9, based on 32-bit words.

Download

Data-Intensive Text Processing with MapReduce by Jimmy Lin & Chris Dyer.epub

Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.

Categories

Computer Vision & Pattern Recognition	Expert Systems
Intelligence & Semantics	Machine Theory
Natural Language Processing	Neural Networks